In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Getting distribution values using scipy

The scipy stats package contains a lot of useful distributions. Here, you can compute z and t statistics, etc. Normal distribution is given in scipy.stats.norm, while the t distribution is given as stats.


In [17]:
# Cumulative probability P(X<120) where X ~ N(100, 10^2)
print("P(X<120) where X ~ N(100, 10^2) = %.3f" % stats.norm.cdf(120, loc=100, scale=10))

# Calculate value
print("x for which P(X < x = 0.97) = %.1f" % stats.norm.ppf(0.97, loc=100, scale=10))


P(X<120) where X ~ N(100, 10^2) = 0.977
x for which P(X < x = 0.97) = 118.8

In [18]:
# Cumulative probability P(X<120) where X ~ N(100, 10^2)
print("P(X<120) where X ~ t with df = 10, mean = 100 and sigma = 10) = %.3f" % stats.t.cdf(120, df=10, loc=100, scale=10))

# Calculate value
print("x for which P(X < x = 0.97) = %.1f" % stats.t.ppf(0.97, df=10, loc=100, scale=10))


P(X<120) where X ~ t with df = 10, mean = 100 and sigma = 10) = 0.963
x for which P(X < x = 0.97) = 121.2

Linear regression

Let us first read the data from the csv file into a Pandas "DataFrame" using Pandas' built in parsers.


In [2]:
df = pd.read_csv("co2_temp_yr.csv", delimiter=",")
print(df)


    Year     CO2 ppm  Global Temp
0   1965  320.044167          -10
1   1966  321.383333           -4
2   1967  322.157500           -1
3   1968  323.045000           -5
4   1969  324.624167            6
5   1970  325.680000            4
6   1971  326.320000           -7
7   1972  327.453333            2
8   1973  329.676667           15
9   1974  330.176667           -7
10  1975  295.202500           -1
11  1976  332.053333          -12
12  1977  333.781667           14
13  1978  335.409167            5
14  1979  336.781667           11
15  1980  338.679167           22
16  1981  340.100000           28
17  1982  341.436667            9
18  1983  343.025000           27
19  1984  307.350000           11
20  1985  346.041667            8
21  1986  347.384167           14
22  1987  349.160833           28
23  1988  351.563333           35
24  1989  353.067500           24
25  1990  354.346667           39
26  1991  355.566667           38
27  1992  356.382500           19
28  1993  357.067500           20
29  1994  358.822500           28
30  1995  360.795833           42
31  1996  362.587500           32
32  1997  363.705000           45
33  1998  366.652500           61
34  1999  368.325833           39
35  2000  369.525000           40
36  2001  371.130000           52
37  2002  373.215000           60
38  2003  375.774167           59
39  2004  377.490833           51
40  2005  379.800000           65
41  2006  381.904167           59
42  2007  383.764167           62
43  2008  385.585833           49
44  2009  387.366667           59
45  2010  389.844167           66

Quick exploration

The DataFrame contains a lot of useful convenience functions. For example, df.describe() gives you a quick summary of useful information! See more at http://pandas.pydata.org/pandas-docs/stable/10min.html.


In [3]:
df.describe()


Out[3]:
Year CO2 ppm Global Temp
count 46.000000 46.000000 46.000000
mean 1987.500000 349.592391 26.108696
std 13.422618 23.034181 23.586463
min 1965.000000 295.202500 -12.000000
25% 1976.250000 330.645834 6.500000
50% 1987.500000 350.362083 25.500000
75% 1998.750000 367.907500 44.250000
max 2010.000000 389.844167 66.000000

You can also do quick plotting of the data. The results are not aesthetically the best, but it is useful for a quick visual of the data


In [4]:
ax = df.plot(x="CO2 ppm", y="Global Temp", style='o')


Much nicer results can be obtained using a dedicated plotter like seaborn.


In [5]:
ax = sns.regplot(x="CO2 ppm", y="Global Temp", data=df)


Performing the regression

We will use scipy's built in regression analysis here. There are quite a number of options out there, e.g., statsmodels, scikit-learn, etc., that you can explore.


In [6]:
res = stats.linregress(df["CO2 ppm"], df["Global Temp"])
print("Slope = %.3f" % res.slope)
print("Intercept = %.3f" % res.intercept)
print("R = %.3f" % res.rvalue)
print("Std error = %.3f" % res.stderr)


Slope = 0.928
Intercept = -298.244
R = 0.906
Std error = 0.065

ANOVA anslysis

We can also perform ANOVA analysis using pandas + scipy. Again, let us now read the polymer.csv data into a DataFrame.


In [7]:
df = pd.read_table("polymer.csv", delimiter=",", index_col=0)
print(df)


         0    1    2    3    4    5
-1.00  8.2  8.0  8.2  7.9  8.1  8.0
-0.75  8.3  8.4  8.3  8.2  8.3  8.1
-0.50  8.9  8.7  8.9  8.4  8.3  8.5
 0.00  8.5  8.7  8.7  8.7  8.8  8.8
 0.50  8.8  9.1  9.0  8.7  8.9  8.5
 1.00  8.6  8.5  8.6  8.7  8.8  8.8

Each row in the above data is one group.


In [8]:
print(stats.f_oneway(*df.as_matrix()))


F_onewayResult(statistic=18.882352941176471, pvalue=1.8177090258544406e-08)

In [ ]:


In [ ]: